Jupyter IPython notebooks, such as this one, allow you to run both Python code and, using 'magics' also shell commands. In this tutorial we'll use both, since we will be interfacing with a variety of software, as well as processing data.
First, let's look around in the directory using standard Linux commands. We can execute a shell command by preceding it with an exclamation mark.
In [2]:
!ls -lh ../data/reads
We see that there are five files four of these are mutants, and and one reference original sample. We will take a look inside one of the files and look at the distribution of read statistics.
The reads are in text files, which have been compressed using gzip
, a common practice for storing raw data. You can look inside by decompressing a file, piping the output to a program called head
, which will stop after a few lines. You don't want to print the contents of the entire file to screen, since it will likely crash IPython.
In [3]:
!gunzip -c ../data/reads/mutant1_OIST-2015-03-28.fq.gz | head -8
Each read in the fastq file format has four lines, one is a unique read name, one containing the sequence of bases, one +
, and one containing quality scores. The quality scores correspond to the sequencer's confidence in making the base call.
It is good practice to examine the quality of your data before you proceed with the analysis. We'll use a popular tools called FastQC to do some exploratory analysis.
In [9]:
!fastqc ../data/reads/mutant1_OIST-2015-03-28.fq.gz
In [10]:
from IPython.display import IFrame
IFrame('../data/reads/mutant1_OIST-2015-03-28_fastqc.html', width=1000, height=1000)
Out[10]:
We can explore the contents of read files programmatically using a library within Python called Biopython. This allows to automate many tedious tasks.
In [6]:
import gzip
from Bio import SeqIO
with gzip.open("../data/reads/mutant1_OIST-2015-03-28.fq.gz", 'rt') as infile: # open and decompress input file
for rec in SeqIO.parse(infile, "fastq"): # start looping over all records
print(rec) #print record contents
break # stop looping, we only want to see one record
You can see the methods associated with each object, suce as rec
usig the dir command.
In [6]:
print(dir(rec)) # print methods associaat
For example, we can reverse complement the sequence:
In [7]:
rec.reverse_complement()
Out[7]:
There are lots of other interesting functions to explore!
Exercises should be done in Python or bash.
reads/mutant1_OIST-2015-03-28.fq.g
named M00923:134:000000000-A5ELA:1:2109:24002:5853
. What is the average error rate? How many errors can we expect per read?